89 research outputs found
Bilingual distributed word representations from document-aligned comparable data
We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and contextcounting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.This work was done while Ivan Vuli c was a postdoctoral researcher at Department of Computer Science, KU Leuven supported by the PDM Kort fellowship (PDMK/14/117). The work was also supported by the SCATE project (IWT-SBO 130041) and the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (648909)
Recommended from our members
Automatic detection and correction of context-dependent dt-mistakes using neural networks
We introduce a novel approach to correcting context-dependent dt-mistakes, one of the most frequent spelling errors in the Dutch language. We show that by using a neural network to estimate the probability distribution of a verb's suffix conditioned jointly on its stem and context, we obtain large improvements over state-of-the-art spell checkers on three different benchmarking datasets, achieving a perfect score on a verb spelling test from \emph{de Standaard}, a Flemish newspaper. The method is unsupervised and only relies on basic preprocessing tools to tokenize the text and identify verbs, which enables training on millions of sentences. Furthermore, we propose a method to determine which words in a sentence cause the system to make corrections, which is valuable for providing feedback to the user
Learning unsupervised multilingual word embeddings with incremental multilingual hubs
Recent research has discovered that a shared bilingual word embedding space can be induced by projecting monolingual word embedding spaces from two languages using a self-learning paradigm without any bilingual supervision. However, it has also been shown that for distant language pairs such fully unsupervised self-learning methods are unstable and often get stuck in poor local optima due to reduced isomorphism between starting monolingual spaces. In this work, we propose a new robust framework for learning unsupervised multilingual word embeddings that mitigates the instability issues. We learn a shared multilingual embedding space for a variable number of languages by incrementally adding new languages one by one to the current multilingual space. Through the gradual language addition our method can leverage the interdependencies between the new language and all other languages in the current multilingual hub/space. We find that it is beneficial to project more distant languages later in the iterative process. Our fully unsupervised multilingual embedding spaces yield results that are on par with the state-of-the-art methods in the bilingual lexicon induction (BLI) task, and simultaneously obtain state-of-the-art scores on two downstream tasks: multilingual document classification and multilingual dependency parsing, outperforming even supervised baselines. This finding also accentuates the need to establish evaluation protocols for cross-lingual word embeddings beyond the omnipresent intrinsic BLI task in future work
Bilingual lexicon induction by learning to combine word-level and character-level representations
We study the problem of bilingual lexicon induction (BLI) in a setting where some translation resources are available, but unknown translations are sought for certain, possibly domain-specific terminology. We frame BLI as a classification problem for which we design a neural network based classification architecture composed of recurrent long short-term memory and deep feed forward networks. The results show that word- and character-level representations each improve state-of-the-art results for BLI, and the best results are obtained by exploiting the synergy between these word- and character-level representations in the classification model
Recommended from our members
Multi-Modal Representations for Improved Bilingual Lexicon Learning
Recent work has revealed the potential of using visual representations for bilingual lexicon learning (BLL). Such image-based BLL methods, however, still fall short of linguistic approaches. In this paper, we propose a simple yet effective multimodal approach that learns bilingual semantic representations that fuse linguistic and visual input. These new bilingual multi-modal embeddings display significant performance gains in the BLL task for three language pairs on two benchmarking test sets, outperforming linguistic-only BLL models using three different types of state-of-the-art bilingual word embeddings, as well as visual-only BLL models.This work is supported by ERC Consolidator Grant LEXICAL (648909) and KU Leuven Grant PDMK/14/117. SC is supported by ERC Starting Grant DisCoTex (306920)
Do Meio- and Macrobenthic Nematodes Differ in Community Composition and Body Weight Trends with Depth?
Nematodes occur regularly in macrobenthic samples but are rarely identified from them and are thus considered exclusively a part of the meiobenthos. Our study compares the generic composition of nematode communities and their individual body weight trends with water depth in macrobenthic (>250/300 µm) samples from the deep Arctic (Canada Basin), Gulf of Mexico (GOM) and the Bermuda slope with meiobenthic samples (<45 µm) from GOM. The dry weight per individual (µg) of all macrobenthic nematodes combined showed an increasing trend with increasing water depth, while the dry weight per individual of the meiobenthic GOM nematodes showed a trend to decrease with increasing depth. Multivariate analyses showed that the macrobenthic nematode community in the GOM was more similar to the macrobenthic nematodes of the Canada Basin than to the GOM meiobenthic nematodes. In particular, the genera Enoploides, Crenopharynx, Micoletzkyia, Phanodermella were dominant in the macrobenthos and accounted for most of the difference. Relative abundance of non-selective deposit feeders (1B) significantly decreased with depth in macrobenthos but remained dominant in the meiobenthic community. The occurrence of a distinct assemblage of bigger nematodes of high dry weight per individual in the macrobenthos suggests the need to include nematodes in macrobenthic studies
Contribution of Distinct Homeodomain DNA Binding Specificities to Drosophila Embryonic Mesodermal Cell-Specific Gene Expression Programs
Homeodomain (HD) proteins are a large family of evolutionarily conserved transcription factors (TFs) having diverse developmental functions, often acting within the same cell types, yet many members of this family paradoxically recognize similar DNA sequences. Thus, with multiple family members having the potential to recognize the same DNA sequences in cis-regulatory elements, it is difficult to ascertain the role of an individual HD or a subclass of HDs in mediating a particular developmental function. To investigate this problem, we focused our studies on the Drosophila embryonic mesoderm where HD TFs are required to establish not only segmental identities (such as the Hox TFs), but also tissue and cell fate specification and differentiation (such as the NK-2 HDs, Six HDs and identity HDs (I-HDs)). Here we utilized the complete spectrum of DNA binding specificities determined by protein binding microarrays (PBMs) for a diverse collection of HDs to modify the nucleotide sequences of numerous mesodermal enhancers to be recognized by either no or a single subclass of HDs, and subsequently assayed the consequences of these changes on enhancer function in transgenic reporter assays. These studies show that individual mesodermal enhancers receive separate transcriptional input from both I–HD and Hox subclasses of HDs. In addition, we demonstrate that enhancers regulating upstream components of the mesodermal regulatory network are targeted by the Six class of HDs. Finally, we establish the necessity of NK-2 HD binding sequences to activate gene expression in multiple mesodermal tissues, supporting a potential role for the NK-2 HD TF Tinman (Tin) as a pioneer factor that cooperates with other factors to regulate cell-specific gene expression programs. Collectively, these results underscore the critical role played by HDs of multiple subclasses in inducing the unique genetic programs of individual mesodermal cells, and in coordinating the gene regulatory networks directing mesoderm development.National Institutes of Health (U.S.) (Grant R01 HG005287
Optimal deployment of components of cloud-hosted application for guaranteeing multitenancy isolation
One of the challenges of deploying multitenant cloud-hosted
services that are designed to use (or be integrated with) several
components is how to implement the required degree
of isolation between the components when there is a change
in the workload. Achieving the highest degree of isolation
implies deploying a component exclusively for one tenant;
which leads to high resource consumption and running cost
per component. A low degree of isolation allows sharing of
resources which could possibly reduce cost, but with known
limitations of performance and security interference. This
paper presents a model-based algorithm together with four
variants of a metaheuristic that can be used with it, to provide
near-optimal solutions for deploying components of a
cloud-hosted application in a way that guarantees multitenancy
isolation. When the workload changes, the model based
algorithm solves an open multiclass QN model to
determine the average number of requests that can access
the components and then uses a metaheuristic to provide
near-optimal solutions for deploying the components. Performance
evaluation showed that the obtained solutions had
low variability and percent deviation when compared to the
reference/optimal solution. We also provide recommendations
and best practice guidelines for deploying components
in a way that guarantees the required degree of isolation
Interstitial lung disease in children - genetic background and associated phenotypes
Interstitial lung disease in children represents a group of rare chronic respiratory disorders. There is growing evidence that mutations in the surfactant protein C gene play a role in the pathogenesis of certain forms of pediatric interstitial lung disease. Recently, mutations in the ABCA3 transporter were found as an underlying cause of fatal respiratory failure in neonates without surfactant protein B deficiency. Especially in familiar cases or in children of consanguineous parents, genetic diagnosis provides an useful tool to identify the underlying etiology of interstitial lung disease. The aim of this review is to summarize and to describe in detail the clinical features of hereditary interstitial lung disease in children. The knowledge of gene variants and associated phenotypes is crucial to identify relevant patients in clinical practice
Radiation chemistry of solid-state carbohydrates using EMR
We review our research of the past decade towards identification of radiation-induced radicals in solid state sugars and sugar phosphates. Detailed models of the radical structures are obtained by combining EPR and ENDOR experiments with DFT calculations of g and proton HF tensors, with agreement in their anisotropy serving as most important criterion. Symmetry-related and Schonland ambiguities, which may hamper such identification, are reviewed. Thermally induced transformations of initial radiation damage into more stable radicals can also be monitored in the EPR (and ENDOR) experiments and in principle provide information on stable radical formation mechanisms. Thermal annealing experi-ments reveal, however, that radical recombination and/or diamagnetic radiation damage is also quite important. Analysis strategies are illustrated with research on sucrose. Results on dipotassium glucose-1-phosphate and trehalose dihydrate, fructose and sorbose are also briefly discussed. Our study demonstrates that radiation damage is strongly regio-selective and that certain general principles govern the stable radical formation
- …